Use mozinor for classification

Import the main module



In [1]:

    
from mozinor.baboulinet import Baboulinet









    



/home/jwuthri/anaconda3/lib/python3.6/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

Prepare the pipeline

(str) filepath: Give the csv file
(str) y_col: The column to predict
(bool) regression: Regression or Classification ?
(bool) process: (WARNING) apply some preprocessing on your data (tune this preprocess with params below)
(char) sep: delimiter
(list) col_to_drop: which columns you don't want to use in your prediction
(bool) derivate: for all features combination apply, n1 * n2, n1 / n2 ...
(bool) transform: for all features apply, log(n), sqrt(n), square(n)
(bool) scaled: scale the data ?
(bool) infer_datetime: for all columns check the type and build new columns from them (day, month, year, time) if they are date type
(str) encoding: data encoding
(bool) dummify: apply dummies on your categoric variables

The data files have been generated by sklearn.dataset.make_classification



In [2]:

    
cls = Baboulinet(filepath="toto.csv", y_col="predict", regression=False)

Now run the pipeline

May take some times



In [3]:

    
res = cls.babouline()









    



Reading the file toto.csv
Read csv file: toto.csv
args: {'encoding': 'utf-8-sig', 'sep': ',', 'decimal': ',', 'engine': 'python', 'filepath_or_buffer': 'toto.csv', 'thousands': '.', 'parse_dates': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'predict'], 'infer_datetime_format': True}
Inital dtypes is a          float64
b          float64
c          float64
d          float64
e          float64
f          float64
g          float64
h          float64
predict      int64
dtype: object
Work on PolynomialFeatures: degree 1
Optimal number of clusters






    



(10000, 9)

    Polynomial Features: generate a new feature matrix
    consisting of all polynomial combinations of the features.
    For 2 features [a, b]:
        the degree 1 polynomial give [a, b]
        the degree 2 polynomial give [1, a, b, a^2, ab, b^2]
    ...


    ELBOW: explain the variance as a function of clusters.







    












    



Optimal number of trees






    



    OOB: this is the average error for each training observations,
    calculted using the trees that doesn't contains this observation
    during the creation of the tree.







    












    



Estimator ExtraTreesClassifier






    



    ExtraTreesClassifier: as in random forests, a random subset of candidate
    features is used, but instead of looking for the most discriminative
    thresholds, thresholds are drawn at random for each candidate feature and
    the best of these randomly-generated thresholds is picked as
    the splitting rule.

Fitting 3 folds for each of 10 candidates, totalling 30 fits






    



[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    9.1s finished
   Best params => {'n_estimators': 100, 'min_samples_split': 4, 'min_samples_leaf': 1, 'max_features': 0.6, 'criterion': 'entropy', 'bootstrap': False}
   Best Score => 0.865
Estimator XGBClassifier






    



    Gradient boosting is an approach where new models are created that predict
    the residuals or errors of prior models and then added together to make
    the final prediction. It is called gradient boosting because it uses a
    gradient descent algorithm to minimize the loss when adding new models.

Fitting 3 folds for each of 10 candidates, totalling 30 fits






    



[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:   38.4s finished
   Best params => {'subsample': 0.9, 'n_estimators': 50, 'min_child_weight': 6, 'max_depth': 8, 'learning_rate': 0.5}
   Best Score => 0.855
Estimator KNeighborsClassifier






    



    KNeighborsClassifier: Majority vote of its k nearest neighbors.

Fitting 3 folds for each of 10 candidates, totalling 30 fits
Fitting 3 folds for each of 4 candidates, totalling 12 fits






    



[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed:    4.1s finished
   Best params => {'n_neighbors': 17, 'p': 2, 'weights': 'distance'}
   Best Score => 0.853
Estimator DecisionTreeClassifier






    



    Decision Tree Classifier: poses a series of carefully crafted questions
    about the attributes of the test record. Each time time it receive an answer,
    a follow-up question is asked until a conclusion about the calss label
    of the record is reached.

Fitting 3 folds for each of 10 candidates, totalling 30 fits






    



[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    2.0s finished
   Best params => {'min_samples_split': 5, 'min_samples_leaf': 2, 'max_depth': 10, 'criterion': 'entropy'}
   Best Score => 0.750
Check the decision tree: 2017-08-1813:13:19.847449.png
Work on PolynomialFeatures: degree 2
Optimal number of clusters






    



dot: graph is too large for cairo-renderer bitmaps. Scaling by 0.880171 to fit


    Polynomial Features: generate a new feature matrix
    consisting of all polynomial combinations of the features.
    For 2 features [a, b]:
        the degree 1 polynomial give [a, b]
        the degree 2 polynomial give [1, a, b, a^2, ab, b^2]
    ...


    ELBOW: explain the variance as a function of clusters.







    












    



Optimal number of trees






    



    OOB: this is the average error for each training observations,
    calculted using the trees that doesn't contains this observation
    during the creation of the tree.







    












    



Estimator ExtraTreesClassifier






    



    ExtraTreesClassifier: as in random forests, a random subset of candidate
    features is used, but instead of looking for the most discriminative
    thresholds, thresholds are drawn at random for each candidate feature and
    the best of these randomly-generated thresholds is picked as
    the splitting rule.

Fitting 3 folds for each of 10 candidates, totalling 30 fits






    



[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:   17.0s finished
   Best params => {'n_estimators': 50, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': 0.1, 'criterion': 'gini', 'bootstrap': False}
   Best Score => 0.857
Estimator XGBClassifier






    



    Gradient boosting is an approach where new models are created that predict
    the residuals or errors of prior models and then added together to make
    the final prediction. It is called gradient boosting because it uses a
    gradient descent algorithm to minimize the loss when adding new models.

Fitting 3 folds for each of 10 candidates, totalling 30 fits






    



[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:  1.3min finished
   Best params => {'subsample': 0.9, 'n_estimators': 100, 'min_child_weight': 7, 'max_depth': 4, 'learning_rate': 0.5}
   Best Score => 0.857
Estimator KNeighborsClassifier






    



    KNeighborsClassifier: Majority vote of its k nearest neighbors.

Fitting 3 folds for each of 10 candidates, totalling 30 fits
Fitting 3 folds for each of 4 candidates, totalling 12 fits






    



[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed:   36.3s finished
   Best params => {'n_neighbors': 11, 'p': 2, 'weights': 'distance'}
   Best Score => 0.853
Estimator DecisionTreeClassifier






    



    Decision Tree Classifier: poses a series of carefully crafted questions
    about the attributes of the test record. Each time time it receive an answer,
    a follow-up question is asked until a conclusion about the calss label
    of the record is reached.

Fitting 3 folds for each of 10 candidates, totalling 30 fits






    



[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    7.5s finished
   Best params => {'min_samples_split': 7, 'min_samples_leaf': 8, 'max_depth': 6, 'criterion': 'gini'}
   Best Score => 0.738
Check the decision tree: 2017-08-1813:18:56.364832.png
                                           Estimator     Score  Degree
0  (ExtraTreeClassifier(class_weight=None, criter...  0.864667       1
1  XGBClassifier(base_score=0.5, colsample_byleve...  0.856800       2
2  (ExtraTreeClassifier(class_weight=None, criter...  0.856667       2
3  XGBClassifier(base_score=0.5, colsample_byleve...  0.855333       1
4  KNeighborsClassifier(algorithm='auto', leaf_si...  0.853333       1
5  KNeighborsClassifier(algorithm='auto', leaf_si...  0.852933       2
6  DecisionTreeClassifier(class_weight=None, crit...  0.750400       1
7  DecisionTreeClassifier(class_weight=None, crit...  0.737867       2






    



    Stacking: is a model ensembling technique used to combine information
    from multiple predictive models to generate a new model.

task:   [classification]
metric: [accuracy_score]

model 0: [ExtraTreesClassifier]
    ----
    MEAN:   [0.86173333]

model 1: [XGBClassifier]
    ----
    MEAN:   [0.84853333]

model 2: [KNeighborsClassifier]
    ----
    MEAN:   [0.86053333]

model 3: [DecisionTreeClassifier]






    



  0%|          | 0/15 [00:00<?, ?it/s]





    



    ----
    MEAN:   [0.75666667]







    



Stacking 4 models: 100%|██████████| 15/15 [00:21<00:00,  1.63s/it]

The class instance, now contains 2 objects, the model for this data, and the best stacking for this data

To make auto generate the code of the model

Generate the code for the best model



In [4]:

    
cls.bestModelScript()









    



Check script file toto_solo_model_script.py






    Out[4]:





'toto_solo_model_script.py'

Generate the code for the best stacking



In [5]:

    
cls.bestStackModelScript()









    



Check script file toto_stack_model_script.py






    Out[5]:





'toto_stack_model_script.py'

To check which model is the best

Best model



In [6]:

    
res.best_model









    Out[6]:





Estimator    (ExtraTreeClassifier(class_weight=None, criter...
Score                                                 0.864667
Degree                                                       1
Name: 0, dtype: object



In [7]:

    
show = """
    Model: {},
    Score: {}
"""
print(show.format(res.best_model["Estimator"], res.best_model["Score"]))









    



    Model: ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='entropy',
           max_depth=None, max_features=0.6, max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=4, min_weight_fraction_leaf=0.0,
           n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False),
    Score: 0.8646666666666667

Best stacking



In [8]:

    
res.best_stack_models









    Out[8]:





Fit1stLevelEstimator    [(ExtraTreeClassifier(class_weight=None, crite...
Fit2ndLevelEstimator    DecisionTreeClassifier(class_weight=None, crit...
Score                                                              0.8736
Degree                                                                  1
Name: 0, dtype: object



In [9]:

    
show = """
    FirstModel: {},
    SecondModel: {},
    Score: {}
"""
print(show.format(res.best_stack_models["Fit1stLevelEstimator"], res.best_stack_models["Fit2ndLevelEstimator"], res.best_stack_models["Score"]))









    



    FirstModel: [ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='entropy',
           max_depth=None, max_features=0.6, max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=4, min_weight_fraction_leaf=0.0,
           n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False), XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.5, max_delta_step=0, max_depth=8,
       min_child_weight=6, missing=None, n_estimators=50, nthread=-1,
       objective='multi:softprob', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=0.9), KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=17, p=2,
           weights='distance')],
    SecondModel: DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=10,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=2,
            min_samples_split=5, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'),
    Score: 0.8736